46 research outputs found

    Should we use movie subtitles to study linguistic patterns of conversational speech? A study based on French, English and Taiwan Mandarin

    Get PDF
    International audienceLinguistic research benefits from the wide range of resources and software tools developed for natural language processing (NLP) tasks. However, NLP has a strong historical bias towards written language, thereby making these resources and tools often inadequate to address research questions related to the linguistic patterns of spontaneous speech. In this preliminary study, we investigate whether corpora of movie and TV subtitles can be employed to estimate data-driven NLP models adapted to conversational speech. In particular, the presented work explore lexical and syntactic distributional aspects across three genres (conversational, written and subtitles) and three languages (French, English and Taiwan Mandarin). Ongoing work focuses on comparing these three genres on the basis of deeper syntactic conversational patterns , using graph-based modelling and visualisation

    Can MDL Improve Unsupervised Chinese Word Segmentation?

    Get PDF
    International audienceIt is often assumed that Minimum Descrip- tion Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Man- darin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algo- rithms previously proposed in the litera- ture. Suprisingly, we show that this lower Description Length does not necessarily corresponds to better segmentation results. Finally, we show that we can use very basic linguistic knowledge to coerce the MDL towards a linguistically plausible hypoth- esis and obtain better results than any pre- viously proposed method for unsupervised Chinese word segmentation with minimal human effort

    BACANAL : Balades Aléatoires Courtes pour ANAlyses Lexicales Application à la substitution lexicale

    Get PDF
    International audienceNous proposons ici des méthodes de désambiguisation sémantique par substition lexicale pour la tâche 1 de l'atelier SemDis2014. Les méthodes exposées dans ce papier sont toutes bâties à partir de balades aléatoires courtes dans des graphes unipartis ou bipartis construits sur diverses ressources. Certaines de ces méthodes n'utilisent que des graphes construits automatiquement à partir de corpus (méthodes non supervisées), d'autres utilisent des graphes construits à partir de ressources produites « à la main » par des lexicographes ou par les foules (méthodes supervisées). Abstract. In this paper, we propose word sense disambiguation methods based on lexical substitution and used for the task 1 of the SemDis2014 workshop. This methods are run by using short random walks on unipartite networks or bipartite networks. Some of these methods only use graphs automatically built from corpora (unsurpervised methods), others also use graphs built from handcraft resources filled by lexicographers or by the crowds (supervised methods). Mots-clés : désambiguisation sémantique, substition lexicale, réseaux lexicaux, balades aléatoires courtes

    Graph representation of synonymy and translation resources for cross-linguistic modelisation of meaning

    Get PDF

    Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison

    Get PDF
    International audienceSystematic cross-linguistic studies of verbs syntactic-semantic behaviors for ty-pologically distant languages such as Mandarin Chinese and French are difficult to conduct.Such studies are nevertheless necessary due to the crucial role that verbal constructions playin the mental lexicon. This paper addresses the problem by combining psycho-linguisticsand computational methods. Psycho-linguistics provides us with a bilingual corpus that fea-tures verbal construction associated with carefully built extra-linguistic material (short videoclips). Computational approaches bring us distributional semantic models (DSM) to measurethe distance between linguistic elements in the extra-linguistic space. These models allowsfor cross-linguistic measures that we evaluate against manually annotated data. In this pa-per, we discuss the results, potential shortcomings involving cultural variability and how tomeasure such bias

    Skillex, an action labelling efficiency score: the case for French and Mandarin

    Get PDF
    International audienceWe propose a model to compute two measurements of semantic efficiency of verbs as action labels. It is based on the exploration of the specific structure of synonymy networks of verbs. We use these measurements to analyse and compare the semantic efficiency of [Children/Adults] productions in action labelling tasks, in French and Mandarin. The combination of these two measurements leads to a generic score of semantic efficiency, Skillex. Assigned to participants of the Approx protocol experiment, this score enables us to accurately classify them into Children and Adults categories, be they French or Mandarin native speakers

    Wiktionary and NLP: Improving synonymy networks

    Get PDF
    International audienceWiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracted synonymy networks from this resource. We provide an in-depth study of these synonymy networks and compare them to those extracted from traditional resources. Finally, we describe two methods for semiautomatically improving this network by adding missing relations: (i) using a kind of semantic proximity measure; (ii) using translation relations of Wiktionary itself
    corecore